Introduction

This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.

Outline

Functions to process the data
Import and examine the data
Cluster data into 10 categories (one for each digit)
Evaluate model results
Summary

Import Necessary Modules



In [17]:

    
import pandas as pd
import numpy as np
import sklearn.cluster as skc
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

0. Functions to Process Data



In [5]:

    
def ij2index(ii,jj):
    """
    Converts pixel indices ii (row) and jj (column)
    to a single value in the grid below:
    
         jj=0 jj=1 jj=2 jj=3   jj=26 jj=27
    ii=0  000  001  002  003 ... 026  027
    ii=1  028  029  030  031 ... 054  055
    ii=2  056  057  058  059 ... 082  083
           |    |    |    |  ...  |    |
    ii=26 728  729  730  731 ... 754  755
    ii=27 756  757  758  759 ... 782  783
    """
    
    # Number of ii,jj
    nI = 28
    nJ = 28
    return ii*nJ + jj
    
def index2ij(index):
    """
    Converts 1D index to 2D pixel indices 
    ii (row) and jj (column) from the grid below:
    
         jj=0 jj=1 jj=2 jj=3   jj=26 jj=27
    ii=0  000  001  002  003 ... 026  027
    ii=1  028  029  030  031 ... 054  055
    ii=2  056  057  058  059 ... 082  083
           |    |    |    |  ...  |    |
    ii=26 728  729  730  731 ... 754  755
    ii=27 756  757  758  759 ... 782  783
    """
    
    # Number of ii,jj
    nI = 28
    nJ = 28
    jj = index%nJ
    ii = (index-jj)/nI
    return (ii,jj)

def convert

1. Read Digit Data



In [10]:

    
data = pd.read_csv("./data/digits/train.csv")
data.head()









    Out[10]:






  
    
      
      label
      pixel0
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      ...
      pixel774
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
    
  
  
    
      0
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       4
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 785 columns



In [11]:

    
# Split up digit images and labels
targets = data['label']
digits = data.drop('label',axis=1)



In [16]:

    
# Plot one of the digits
plt.imshow(digits.loc[1000].reshape(28,28),cmap=cm.Greys,interpolation='none')









    Out[16]:





<matplotlib.image.AxesImage at 0x11f70f990>



In [60]:

    
# Plot frequency of digits in the dataset
targets.hist()









    Out[60]:





<matplotlib.axes._subplots.AxesSubplot at 0x10862db10>

2. Cluster data into 10 categories

We use k-means to classify the available greyscale images into 10 categories. I do not anticipate a clear relationship between the resulting clusters and the 10 digits (0-9) because our method is neither scale, translation, or rotaion invariant.

Nevertheless, let's try and see how it goes!



In [55]:

    
model = skc.KMeans(n_clusters=10,n_init=1,random_state=1)



In [56]:

    
model.fit(digits)









    Out[56]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=1,
    n_jobs=1, precompute_distances=True, random_state=1, tol=0.0001,
    verbose=0)



In [68]:

    
output = model.predict(digits)

3. Evaluate Model Results



In [69]:

    
# Plot the center of each cluster returned from the k-means algorithm
for ii in range(10):
    plt.subplot(2,5,ii+1)
    plt.imshow(model.cluster_centers_[ii,:].reshape(28,28),cmap=cm.Greys,interpolation='none')
    plt.title('ii = {}'.format(ii))

This is better than I expected! Several of the centroids clearly correspond to recognizable digits. Some deficiences include no digit "5", two digits "0", and strong similarities between digits "4", "7", and "9".



In [70]:

    
# Plot number of predicted values in each cluster
output = model.predict(digits)
height,left = np.histogram(output)
plt.bar(left[:-1],height)









    Out[70]:





<Container object of 10 artists>

This histogram of model predictions reveals more shortcomings. Bins 5 and 7, corresponding to digit "0", contain relatively few numbers. Bin 1, on the other hand, seems to correspond to the digit "1", but the histogram above shows that many other digits have been placed into that bin.



In [83]:

    
# Visually associate the cluster centers with a digit from 0-9
def real2centroid(number):
    """Return the model centroid index associated with the real digit value"""
    realvalues = [7,1,0,6,2,5,8,4,9,3]
    return realvalues[number]
    
def centroid2real(number):
    """Return the real number associated with a given centroid index."""
    centroids = [2,1,4,9,7,0,3,0,6,8]
    return centroids[number]



In [81]:

    
# Convert the model output to predicted digit labels
output = model.predict(digits)
output = map(centroid2real,output)



In [82]:

    
# Calculate fraction correct
print 1-(sum((output-targets)!=0))/float(len(output))









    



0.595428571429

4. Summary

A very simple K-means implementation placed handwritten digits into 10 bins without any labels. In many cases, these bins clearly corresponded to actual digits. We visualize the bins, assign each one a real numerical value, and find that the k-means algorithm correctly categorizes 60% of the digits. That's rather impressive for a straightforward implementation of an unsupervised learning algorithm!

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...